Machine Learning for Data Linkage
نویسندگان
چکیده
Data linkage traditionally uses deterministic and probabilistic methods. Alternatively, machine learning methods can be applied as classification algorithms, using the data to inform decisions. This project compared quality, in terms of precision recall, traditional with selected when a standard problem.
 Two supervised methods, gradient boosted trees (GBT) multiple layered perceptron classifier (MLPC), one unsupervised method, maximum entropy (MEC), were implemented. The England Wales 2021 Census Coverage Survey (CCS) was used gold-standard (GS) linked dataset provide training samples for well testing all F1 score (harmonic mean recall) compare performance models determine optimal parameters thresholds.
 Splink implementation Fellegi-Sunter Expectation Maximisation baseline comparison.
 trained on sample GS, link census CCS data. All performed MEC achieving highest (99.79%) but lowest recall (96.36%). MLPC model achieved (98.94%).
 To understand implications not retraining each dataset, also health dataset. retrained data; instead, optimised GS applied. had (96.51%) (98.48%) (97.49%). With scores 96.99% 96.14% respectively, GBT far behind performance, despite being data.
 We have shown that effectively problems. Unsurprisingly, perform best same Further research into generic may allow us use both future linkage.
منابع مشابه
Machine Learning Models for Housing Prices Forecasting using Registration Data
This article has been compiled to identify the best model of housing price forecasting using machine learning methods with maximum accuracy and minimum error. Five important machine learning algorithms are used to predict housing prices, including Nearest Neighbor Regression Algorithm (KNNR), Support Vector Regression Algorithm (SVR), Random Forest Regression Algorithm (RFR), Extreme Gradient B...
متن کاملMachine Learning, Information Retrieval, and Record Linkage
Classification into groups using terms available in the data underlies machine learning, information retrieval, and record linkage. Classifiers such as Bayesian networks in machine learning and term weighting in information retrieval depend primarily on training data sets for which truth is known. These classifiers may be relatively slow to adapt to new situations in which new data have charact...
متن کاملImproving the Performance of Machine Learning Algorithms for Heart Disease Diagnosis by Optimizing Data and Features
Heart is one of the most important members of the body, and heart disease is the major cause of death in the world and Iran. This is why the early/on time diagnosis is one of the significant basics for preventing and reducing deaths of this disease. So far, many studies have been done on heart disease with the aim of prediction, diagnosis, and treatment. However, most of them have been mostly f...
متن کاملMachine Learning for Sequential Data: A Review
Statistical learning problems in many fields involve sequential data. This paper formalizes the principal learning tasks and describes the methods that have been developed within the machine learning research community for addressing these problems. These methods include sliding window methods, recurrent sliding windows, hidden Markov models, conditional random fields, and graph transformer net...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: International Journal for Population Data Science
سال: 2023
ISSN: ['2399-4908']
DOI: https://doi.org/10.23889/ijpds.v8i2.2240